6 research outputs found
Doubly Stochastic Variational Inference for Deep Gaussian Processes
Gaussian processes (GPs) are a good choice for function approximation as they
are flexible, robust to over-fitting, and provide well-calibrated predictive
uncertainty. Deep Gaussian processes (DGPs) are multi-layer generalisations of
GPs, but inference in these models has proved challenging. Existing approaches
to inference in DGP models assume approximate posteriors that force
independence between the layers, and do not work well in practice. We present a
doubly stochastic variational inference algorithm, which does not force
independence between layers. With our method of inference we demonstrate that a
DGP model can be used effectively on data ranging in size from hundreds to a
billion points. We provide strong empirical evidence that our inference scheme
for DGPs works well in practice in both classification and regression.Comment: NIPS 201
Deep Gaussian Processes: Advances in Models and Inference
Hierarchical models are certainly in fashion these days. It seems difficult to navigate the field of machine learning without encountering `deep' models of one sort or another. The popularity of the deep learning revolution has been driven by some striking empirical successes, prompting both intense rapture and intense criticism. The criticisms often centre around the lack of model uncertainty, leading to sometimes drastically overconfident predictions. Others point to the lack of a mechanism for incorporating prior knowledge, and the reliance on large datasets. A widely held hope is that a Bayesian approach might overcome these problems.
The deep Gaussian process presents a paradigm for building deep models from a Bayesian perspective. A Gaussian process is a prior for functions. A deep Gaussian process uses several Gaussian process functions and combines them hierarchically through composition (that is, the output of one is the input to the next). The deep Gaussian process promises to capture the compositional nature of deep learning while mitigating some of the disadvantages through a Bayesian approach.
The thesis develops deep Gaussian process modelling in a number of ways. The model is first interpreted differently from previous work, not as a `hierarchical prior' but as a factorized prior with an hierarchical likelihood. Mean functions are suggested to avoid issues of degeneracy and to aid initialization. The main contribution is a new method of inference that avoids the burden of representing the function values directly through an application of sparse variational inference. This method scales to arbitrarily large data and is shown to work well in practice through experiments.
The use of variational inference recasts (approximate) inference as optimization of Gaussian distributions. This optimization has an exploitable geometry via the natural gradient. The natural gradient is shown to be advantageous for single layer non-conjugate models, and for the (final layer of a) deep Gaussian process model.
Deep Gaussian processes can be a model both for complex associations between variables and complex marginal distributions of single variables. Incorporating noise in the hierarchy leads to complex marginal distribution through the non-linearities of the mappings at each layer. The inference required for noisy variables cannot be handled with sparse methods, as sparse methods rely on correlations between variables, which are absent for noisy variables. Instead, a more direct approach is developed, using an importance weighted variational scheme.Open Acces
Orthogonally Decoupled Variational Gaussian Processes
Gaussian processes (GPs) provide a powerful non-parametric framework for
reasoning over functions. Despite appealing theory, its superlinear
computational and memory complexities have presented a long-standing challenge.
State-of-the-art sparse variational inference methods trade modeling accuracy
against complexity. However, the complexities of these methods still scale
superlinearly in the number of basis functions, implying that that sparse GP
methods are able to learn from large datasets only when a small model is used.
Recently, a decoupled approach was proposed that removes the unnecessary
coupling between the complexities of modeling the mean and the covariance
functions of a GP. It achieves a linear complexity in the number of mean
parameters, so an expressive posterior mean function can be modeled. While
promising, this approach suffers from optimization difficulties due to
ill-conditioning and non-convexity. In this work, we propose an alternative
decoupled parametrization. It adopts an orthogonal basis in the mean function
to model the residues that cannot be learned by the standard coupled approach.
Therefore, our method extends, rather than replaces, the coupled approach to
achieve strictly better performance. This construction admits a straightforward
natural gradient update rule, so the structure of the information manifold that
is lost during decoupling can be leveraged to speed up learning. Empirically,
our algorithm demonstrates significantly faster convergence in multiple
experiments.Comment: Appearing NIPS 201
Stochastic Differential Equations with Variational Wishart Diffusions
We present a Bayesian non-parametric way of inferring stochastic differential
equations for both regression tasks and continuous-time dynamical modelling.
The work has high emphasis on the stochastic part of the differential equation,
also known as the diffusion, and modelling it by means of Wishart processes.
Further, we present a semi-parametric approach that allows the framework to
scale to high dimensions. This successfully lead us onto how to model both
latent and auto-regressive temporal systems with conditional heteroskedastic
noise. We provide experimental evidence that modelling diffusion often improves
performance and that this randomness in the differential equation can be
essential to avoid overfitting.Comment: ICML 202